Explainability - by Example

Linear Regression Model
Binary Classifier Model
Time Series Forecasting Model
Chain of Thought
Thread of Thought
ReRead Reasoning
Graph of Thought
Chain of verification
Token Importance
Search Augmentation
Logic of Thought
Evaluation Metrics

Discover how explainability can be implemented across traditional and generative AI models through practical examples in this section.

Linear Regression Model

Problem Statement & Solution

Problem Statement: Inaccurate and inconsistent valuation of residential real estate properties results in significant financial risks for investors, buyers, and sellers. Current valuation methods rely heavily on expert judgment and limited data, leading to potential overvaluation or undervaluation.

Solution: Develop a Linear Regression model to accurately predict future market values of residential real estate properties based on a comprehensive dataset of property and market characteristics.

Training Data Details

Feature	Description
MedInc	Median Income: Median income for households within a block of houses (measured in tens of thousands of US Dollars) [10k$]
HouseAge	Age: Median age of a house within a block; a lower number is a newer building [years]
AveRooms	Total number of rooms within a block
AveBedrms	Total number of bedrooms within a block
Population	Total number of people residing within a block
AveOccup	average number of household members
Latitude	A measure of how far north a house is; a higher value is farther north [°]
Longitude	A measure of how far west a house is; a higher value is farther west [°]
MedHouseVal	Median house value for households within a block (measured in US Dollars) [$]

Global explainability in ML models refers to the ability to understand and interpret the overall behavior and decision-making process of a machine learning model across all its predictions, rather than just individual instances. It provides insights into how different features contribute to the model's predictions on average. Following are a few model-agnostic techniques that identify important features influencing model predictions.

Kernel SHAP

Kernel SHAP analysis indicates that income and occupancy are the top features affecting the model's predictions. This means these factors play a significant role in explaining how the model arrives at its conclusions. In simpler terms, changes in income and occupancy levels have the largest impact on predicting the future market value of residential real estate properties.

Local explainability in ML models refers to the ability to understand and interpret the decision-making process of a model for specific instances or predictions. Unlike global explainability, which focuses on the overall behavior of the model, local explainability provides insights into how different features contribute to a model's prediction for a particular input. This helps users understand why the model made a specific decision and allows for greater trust and transparency in the model's outputs

LIME (Local Interpretable Model-agnostic Explanations)

LIME analysis shows that income, location, and occupancy are the most influential features for our model. This indicates that changes in these factors significantly impact the predictions for residential real estate market values. In simpler terms, variations in income, where a property is located, and how many people live in it play a crucial role in determining its future value.

Other Explainer methods

SHAP (SHapley Additive exPlanations)

SHAP analysis reveals that income, location, and occupancy have the highest feature importances. This suggests these features significantly contribute to explaining the model's predictions. In simpler terms, variations in these features have the greatest influence on predicting the future market value of residential real estate properties.	Feature Importance

Permutation Importance

By shuffling the order of income, location, and occupancy, permutation importance assessed the impact on model performance. High importance for these features indicates that shuffling their values significantly disrupts the model's predictions. This suggests they're crucial for the model to make accurate decisions.	Variable Importance

Partial Dependence Variance

Decomposing the model's predictions by income, location, house age and occupancy reveals high partial dependence variance for these features. This signifies that the average prediction of the model exhibits substantial variation with changes in their values. In simpler terms, income, location, and occupancy independently exert a strong influence on the model's output.

Binary Classifier Model

Problem Statement & Solution

Problem Statement: The problem at hand is to accurately predict whether an individual is at risk of developing heart disease. This prediction can be made by analyzing various health-related factors, demographic information, and medical history.

Solution: Develop a Binary Classification model to accurately predict individuals who are at risk of heart disease so that preventive measures can be implemented to improve their health outcomes.

Training Data Details

Feature	Description
age	Age of the individual in years.
sex	Gender of the individual (1 = male, 0 = female).
chest_pain_type	Type of chest pain experienced (0-3 categorical values).
resting_blood_pressure	Resting blood pressure (in mm Hg) measured when the individual is at rest.
serum_cholesterol	Serum cholesterol level (in mg/dl).
fasting_blood_sugar	Fasting blood sugar level (> 120 mg/dl = 1, otherwise = 0).
resting_ecg_results	Resting electrocardiographic results (0-2 categorical values).
max_heart_rate_achieved	Maximum heart rate achieved during exercise (in bpm).
exercise_induced_angina	Exercise induced angina (1 = yes, 0 = no).
oldpeak	ST depression induced by exercise relative to rest.
slope	Slope of the peak exercise ST segment (0-2 categorical values).
number_of_vessels_fluro	Number of major vessels (0-3) colored by fluoroscopy.
thalassemia	Thalassemia status (1 = normal, 2 = fixed defect, 3 = reversible defect).
is_disease	Denotes whether the individual has heart disease (1 = yes, 0 = no).

Global explainability in ML models refers to the ability to understand and interpret the overall behavior and decision-making process of a machine learning model across all its predictions, rather than just individual instances. It provides insights into how different features contribute to the model's predictions on average. Following are a few model-agnostic techniques that identify important features influencing model predictions.

Kernel SHAP

Global Kernel SHAP analysis reveals that sex, chest pain type, and the number of vessels colored by fluoroscopy are the most influential features across all predictions in our heart disease prediction model. This indicates that these factors consistently affect the risk assessment for heart disease across the entire dataset. In simpler terms, the overall patterns suggest that a person's gender, the type of chest pain they experience, and the number of major vessels affected are key indicators in evaluating heart disease risk

Local explainability in ML models refers to the ability to understand and interpret the decision-making process of a model for specific instances or predictions. Unlike global explainability, which focuses on the overall behavior of the model, local explainability provides insights into how different features contribute to a model's prediction for a particular input. This helps users understand why the model made a specific decision and allows for greater trust and transparency in the model's outputs

LIME (Local Interpretable Model-agnostic Explanations)

LIME analysis reveals that sex, chest pain type, and the number of vessels colored by fluoroscopy are the most influential features for our heart disease prediction model. This suggests that changes in these factors significantly impact the model's predictions regarding an individual’s risk of developing heart disease. In simpler terms, variations in a person's gender, the type of chest pain they experience, and how many major vessels are affected are critical in assessing their likelihood of having heart disease. This highlights the importance of these features in understanding individual risk profiles

Other Explainer methods

Problem Statement: The problem at hand is to accurately predict whether an employee is likely to leave an organization. This prediction can be made by analyzing various factors related to the employee's demographics, job satisfaction, and work environment.

Solution: Develop a Binary Classification model to accurately predict employees who are at risk of attrition so that proactive measures can be taken to retain them.

Training Data Details

Feature	Description
num_production_digital_project_changes_last_12_months	Number of changes made to a production-level digital project within the last 12 months
pct_time_non_revenue_last_12_months	Percentage of time an employee spent on non-revenue-generating activities (e.g., administrative tasks, meetings) in the past 12 months.
emp_experience_diff_average_team_leadership_experience_last_9_months	Difference between an employee's experience and the average leadership experience of their team in the past 9 months.
num_promotions_in_past_2_years	Number of promotions an employee has received in the past 2 years.
emp_experience_diff_average_team_experience_3_vs_9_months	Difference between an employee's experience and the average team experience at two different time points: 3 and 9 months ago.
num_production_project_changes_last_6_months	Number of changes or updates made to production projects in last 6 Months
num_production_project_changes_last_9_months	Number of changes or updates made to production projects in last 9 Months
education_level_bachelors	Indicates whether an employee has a bachelor's degree.
education_level_masters	Indicates whether an employee has a master's degree
is_attrited	Denotes whether the employee has resigned the job or not.

SHAP (SHapley Additive exPlanations)

SHAP analysis assigns high importance to bench time, average cohort rating, education level, and number of projects changed. This indicates these features substantially influence the marginal contribution of each feature to the model's predictions. In other words, variations in these features significantly alter how much credit each feature receives for a specific prediction.

Partial Dependence Variance

Partial dependence variance analysis identifies high importance for bench time, average cohort rating, education level, and number of projects changed. This means changes in these features lead to significant variations in the average prediction of the model. In simpler terms, these features independently have a strong effect on where the model's output lands on average.

Anchor

Anchor analysis identifies a specific data point likely to exert significant influence on a model prediction. This data point is characterized by bench time exceeding 44.44 in the last 9 months, alongside a non-positive change in average team leadership rating between month 3 and month 2

Time Series Forecasting Model

Problem Statement & Solution

Problem Statement: Insufficient inventory can result in lost sales due to unavailability of products. Overstocking can tie up capital and increase holding costs.

Solution: Accurate sales forecasting is essential for effective business operations, financial stability, and customer satisfaction. Failure to forecast sales can lead to a variety of problems, including inventory issues, financial difficulties, operational inefficiencies, missed market opportunities, and reduced customer satisfaction.

Training Data Details

Feature	Description
Store	Store number
Dept	Department number
IsHoliday	Whether the week is a special holiday week
Type	Stores has 3 types as A, B and C according to their sizes. Almost half of the stores are bigger than 150000 and categorized as A.
Size	Stores sizes
Temperature	Average temperature in the region
Fuel_Price	Cost of Fuel in the region
MarkDown1	Anonymized data related to promotional markdowns that Walmart is running
MarkDown2	Anonymized data related to promotional markdowns that Walmart is running.
MarkDown3	Anonymized data related to promotional markdowns that Walmart is running
MarkDown4	Anonymized data related to promotional markdowns that Walmart is running
MarkDown5	Anonymized data related to promotional markdowns that Walmart is running
CPI	The consumer price index
Unemployment	The unemployment rate
Day	Day
Week	Week
Month	Month
Quarter	Quarter
Year	Year
Weekly_Sales	Sales for the given department in the given store

Explanation:

LIME: The LIME analysis identifies critical factors such as store size, weekly holidays, regional fuel prices, month, and quarter, which are pivotal for the model's decision-making process in forecasting weekly sales based on the historical time series data.

Chain of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Chain of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.

Explanation:

Prompt: What is the largest river in India?

Chain of thought gives the detailed step by step reasoning for the LLM response to above prompt.

Thread of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Thread of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.

Explanation:

Prompt: What is the largest river in India?

Thread of thought gives the detailed step by step reasoning for the LLM response to above prompt.

ReRead Reasoning

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: ReRead Reasoning mirrors human reasoning. Unlike most thought eliciting prompting methods, such as Chain-of Thought (CoT), which aim to elicit the reasoning process in the output, RE2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. Consequently, RE2 demonstrates strong generality and compatibility with most thought eliciting prompting methods

Explanation:

Prompt: What is the largest river in India?

Thread of thought gives the detailed step by step reasoning for the LLM response to above prompt.

Graph of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Graph of thought Reasoning mirrors human reasoning. It facilitates systematic problem-solving by breaking down complex tasks into a coherent series of logical deductions.

Explanation:

Prompt: What is the largest river in India?

Graph of thought gives the detailed step by step reasoning for the LLM response to above prompt.

Chain of verification

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: Chain of verification helps to understand the LLM response by verifying the base answer with various question and answers.

Verification:

Prompt: What is the largest river in India?

Chain of verification asks the LLM with 5 different verification questions with the context of actual LLM reasoning and based on the gives 5 answers it derives the final answer.

Token Importance

Problem Statement & Solution

Problem Statement: In AI language models, the importance of tokens can significantly influence the generated responses. Understanding which tokens (words or phrases) are most impactful can be crucial for interpreting and trusting the model's decisions. This is particularly relevant in applications where precise and reliable outputs are essential, such as in healthcare, finance, and legal domains.

Solution: Token Importance helps in understanding how different tokens contribute to the AI model's responses. By analyzing the relative importance of tokens, users can gain insight into which parts of the input significantly affect the model’s output. This can enhance transparency and trust in the AI system's decision-making process.

Explanation:

Prompt: What is the largest river in India?

Displays matrix with top 10 tokens and their importance
The Token Importance Distribution Chart illustrates the significance of individual tokens by displaying the distribution of their associated impact scores. The chart's shape reveals the following insights:
- Flat Distribution: Tokens have similar importance, with no clear standout
- Left-Peaked Distribution: Tokens have low impact scores, indicating lesser importance
- Right-Peaked Distribution: Tokens have high impact scores, signifying greater importance
Displays importance of each token (considered top 10 tokens based on their importance for this chart).

Search Augmentation

Problem Statement & Solution

Problem Statement: When users cannot discern how AI systems reach their conclusions, it can undermine trust in the technology. This issue is particularly pressing in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences. Ensuring the accuracy and transparency of AI responses is crucial for maintaining user confidence.

Solution: The Chain of Verification through Internet Search offers a systematic approach to validate the accuracy of an AI response by cross-referencing it with multiple reliable online sources. This method involves querying different authoritative sources to confirm the correctness of the AI's answer and provide clarity on how the response was derived.

Explanation:

Prompt: What is the largest river in India?

Internet Search displays final response by cross validating facts generated by thread of thoughts against internet search results.
List of facts used by thread of thoughts while reasoning the LLM response
Explanation based on internet search results
Judgement on the accuracy of LLM response. The validation process compares the LLM's thread of thoughts against internet search results.
- No: Internet search results contradict the LLM's facts, indicating potential inaccuracies.
- Yes: Internet search results support the LLM's facts, confirming their validity.
- Unclear: Internet search results lack sufficient information to determine the accuracy of the LLM's response, requiring further investigation.

Logic of Thought

Problem Statement & Solution

Problem Statement: When users cannot understand how AI systems reach their decisions, it can erode trust in the technology. This is particularly important in critical domains like healthcare, finance, and criminal justice, where decisions can have significant consequences.

Solution: The logic of thought technique could encompass a variety of methodologies for understanding, analyzing human thinking or reasoning. In essence, it involves formalized techniques for logical reasoning (like deductive reasoning, critical thinking). The purpose of these techniques is to enhance clarity of thought, ensure decisions are made based on sound logic, and improve the cognitive processes we use to navigate the world.

Explanation:

Prompt: Which is the largest river in India?

The Logic of Thought (LoT) extracts propositions and logical expressions, extending them to generate expanded logical information from the input context. This generated logical information is then utilized as an additional augmentation to the input prompts, thereby enhancing the system's logical reasoning capabilities.

Evaluation Metrics

Problem Statement & Solution

Problem Statement: Defining "correct" or "good" explanations can be ambiguous and context dependent. there is no standardized and universally accepted metrics to objectively quantify the quality of explanations generated by LLMs, their overall performance, and the user's satisfaction with the interaction.

Solution: New research on LLM explanation “QUEST” which means Quality of Information, Understanding and Reasoning, Expression Style and Persona, Safety and Harm, and Trust and Confidence, talks about providing explanation for LLM with the following metrics such as Uncertainty, Relevance, coherence, Language Tone, Sentiment Analysis etc. With the help of prompt engineering, we ask the LLM to evaluate the above discussed metrics score and provide the explanation for the same.

Metrics:

Prompt: What is the largest river in India?

Uncertainty quantification and Coherence score are the two-evaluation metrics we have implemented to quantify the quality of the explanations generated by the LLM. having high score of coherence helps to understand how logically the given answer is aligned with the actual query and having less uncertainty score denotes that the LLM has high confidence in providing this answer.

Coherence
Less Coherent: >=0 and <=30
Moderately Coherent: >30 and <=70
Highly Coherent: >70 and <=100

Certainty
Highly Certain: >=0 and <=30 (less uncertainty)
Moderately Certain: >30 and <=70 (moderately uncertain)
Less Certain: >70 and <=100 (highly uncertain)

Note: Few examples, datasets and graphs used in this section are sourced from publicly available information and are attributed to their respective creators.